Handle int64 columns with missing data in SQL Lab#8226
Handle int64 columns with missing data in SQL Lab#8226betodealmeida merged 9 commits intoapache:masterfrom
Conversation
superset/dataframe.py
Outdated
| hasattr(dtype, "type") | ||
| and issubclass(dtype.type, np.generic) | ||
| and np.issubdtype(dtype, np.number) | ||
| and dtype._is_numeric |
There was a problem hiding this comment.
Actually, this method is Pandas specific. I'll have to use this and the previous one together.
|
@robdiciuccio ^^^ |
|
The dataframe implementation looks good, as does the Presto engine dtype fix, but does this fully address #8225 if only Presto is handled? Are there other databases this fix should be implemented for (even in a separate PR)? Also curious about your thoughts on the feasibility of the PyArrow workaround here. |
You're right, this only fixes Presto. I'll do a separate PR addressing the other DBs.
Looks like it would solve our problem, but I don't know if it would be better to monkey patch PyArrow (or if it can be done), or if we should create a light wrapper around it. |
CATEGORY
Choose one
This PR fixes #8225.
SUMMARY
When a column has
int64integers and missing data, Pandas will cast it tofloat64, resulting in loss of precision and possibly returning incorrect numbers.This PR fixes the bug by adding a method to the DB engine specs that returns a
dtypebased on the cursor description, currently implemented in Presto only. With thedtype, we can create a PandasSeriesfor each column, and create aDataFramethat has the proper types.Note that in order to represent the column correctly we need to use a nullable data type, introduced in Pandas 0.240. Unfortunately, PyArrow is unable to serialize the resulting data frame, so
msgpackhas to be disabled.TEST PLAN
Added unit test.
ADDITIONAL INFORMATION
REVIEWERS
@villebro @robdiciuccio